Multiple Linear Regression and Feature Engineering¶

DSC 40A, Fall 2022¶

In [2]:
!pip install pandas
!pip install seaborn
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
Requirement already satisfied: pandas in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (1.5.0)
Requirement already satisfied: numpy>=1.20.3 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from pandas) (1.23.2)
Requirement already satisfied: python-dateutil>=2.8.1 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from pandas) (2022.2.1)
Requirement already satisfied: six>=1.5 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0)
Collecting seaborn
  Downloading seaborn-0.12.0-py3-none-any.whl (285 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 285.1/285.1 kB 2.0 MB/s eta 0:00:00 MB/s eta 0:00:01:01
Requirement already satisfied: pandas>=0.25 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from seaborn) (1.5.0)
Requirement already satisfied: matplotlib>=3.1 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from seaborn) (3.6.0)
Requirement already satisfied: numpy>=1.17 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from seaborn) (1.23.2)
Requirement already satisfied: pyparsing>=2.2.1 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (3.0.9)
Requirement already satisfied: pillow>=6.2.0 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (9.2.0)
Requirement already satisfied: fonttools>=4.22.0 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (4.37.3)
Requirement already satisfied: contourpy>=1.0.1 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (1.0.5)
Requirement already satisfied: python-dateutil>=2.7 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (2.8.2)
Requirement already satisfied: cycler>=0.10 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (0.11.0)
Requirement already satisfied: packaging>=20.0 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (21.3)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from matplotlib>=3.1->seaborn) (1.4.4)
Requirement already satisfied: pytz>=2020.1 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from pandas>=0.25->seaborn) (2022.2.1)
Requirement already satisfied: six>=1.5 in /home/hytruongson/anaconda3/envs/circuitnet/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=3.1->seaborn) (1.16.0)
Installing collected packages: seaborn
Successfully installed seaborn-0.12.0

Part 1 – Multiple Linear Regression¶


Here is a data set of sales figures from different stores.

In [3]:
data = pd.read_csv('sales.csv')
data
Out[3]:
net_sales sq_ft inventory advertising district_size competing_stores
0 231.0 3.0 294 8.2 8.200000 11
1 156.0 2.2 232 6.9 4.100000 12
2 10.0 0.5 149 3.0 4.300000 15
3 519.0 5.5 600 12.0 16.100000 1
4 437.0 4.4 567 10.6 14.100000 5
5 487.0 4.8 571 11.8 12.700000 4
6 299.0 3.1 512 8.1 10.100000 10
7 195.0 2.5 347 7.7 8.400000 12
8 20.0 1.2 212 3.3 2.100000 15
9 68.0 0.6 102 4.9 4.700000 8
10 570.0 5.4 788 17.4 12.300000 1
11 428.0 4.2 577 10.5 14.000000 7
12 464.0 4.7 535 11.3 15.000000 3
13 15.0 0.6 163 2.5 2.500000 14
14 65.0 1.2 168 4.7 3.300000 11
15 98.0 1.6 151 4.6 2.700000 10
16 398.0 4.3 342 5.5 16.000000 4
17 161.0 2.6 196 7.2 6.300000 13
18 397.0 3.8 453 10.4 13.900000 7
19 497.0 5.3 518 11.5 16.299999 1
20 528.0 5.6 615 12.3 16.000000 0
21 99.0 0.8 278 2.8 6.500000 14
22 0.5 1.1 142 3.1 1.600000 12
23 347.0 3.6 461 9.6 11.300000 6
24 341.0 3.5 382 9.8 11.500000 5
25 507.0 5.1 590 12.0 15.700000 0
26 400.0 8.6 517 7.0 12.000000 8

Using just one feature¶

Before we perform multiple linear regression, let's first just perform simple linear regression. We'll try and use square footage to predict net sales; our prediction rule will be

$$ \text{predicted net sales} = w_0 + w_1 (\text{square feet}) $$
In [4]:
px.scatter(data, x='sq_ft', y='net_sales', title='Net Sales vs. Square Feet')

It seems like $w_1^*$, the optimal slope, should be positive. To find $w_0^*$ and $w_1^*$, we'll solve the normal equations.

In [5]:
def solve_normal_equations(X, y):
    '''Returns the optimal parameter vector, w*, given a design matrix X and observation vector y.'''
    return np.linalg.solve(X.T @ X, X.T @ y)
In [6]:
data['1'] = 1

X_one_feature_model = data[['1', 'sq_ft']]
X_one_feature_model.values
Out[6]:
array([[1.        , 3.        ],
       [1.        , 2.20000005],
       [1.        , 0.5       ],
       [1.        , 5.5       ],
       [1.        , 4.4000001 ],
       [1.        , 4.80000019],
       [1.        , 3.0999999 ],
       [1.        , 2.5       ],
       [1.        , 1.20000005],
       [1.        , 0.60000002],
       [1.        , 5.4000001 ],
       [1.        , 4.19999981],
       [1.        , 4.69999981],
       [1.        , 0.60000002],
       [1.        , 1.20000005],
       [1.        , 1.60000002],
       [1.        , 4.30000019],
       [1.        , 2.5999999 ],
       [1.        , 3.79999995],
       [1.        , 5.30000019],
       [1.        , 5.5999999 ],
       [1.        , 0.80000001],
       [1.        , 1.10000002],
       [1.        , 3.5999999 ],
       [1.        , 3.5       ],
       [1.        , 5.0999999 ],
       [1.        , 8.6       ]])
In [7]:
y = data['net_sales']
In [8]:
w_one_feature_model = solve_normal_equations(X_one_feature_model, y)
w_one_feature_model
Out[8]:
array([ 2.57700629, 85.38887328])

This is telling us that the best-fitting line to this dataset is

$$\text{predicted net sales} = 2.577 + 85.389 (\text{square feet})$$

To get predictions for all observations in my dataset:

In [9]:
X_one_feature_model @ w_one_feature_model
Out[9]:
0     258.743626
1     190.432532
2      45.271443
3     472.215809
4     378.288057
5     412.443614
6     267.282505
7     216.049189
8     105.043658
9      53.810332
10    463.676930
11    361.210258
12    403.904694
13     53.810332
14    105.043658
15    139.199206
16    369.749178
17    224.588069
18    327.054721
19    455.138051
20    480.754689
21     70.888106
22     96.504769
23    309.976942
24    301.438063
25    438.060252
26    736.921317
dtype: float64

Let's draw a plot of our prediction rule.

In [10]:
px.scatter(data, x='sq_ft', y='net_sales', title='Net Sales vs. Square Feet')

x_range = np.linspace(0, 10)

fig = go.Figure()
fig.add_trace(go.Scatter(x = data['sq_ft'], y = y, mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_one_feature_model[0] + w_one_feature_model[1] * x_range, 
                         name = 'linear prediction rule', 
                         line=dict(color='red')))

fig.update_layout(xaxis_title = 'Square Feet', yaxis_title = 'Net Sales')

It's also worth calculating the mean squared error of this prediction rule, so that we can compare it to our later prediction rules.

In [11]:
def mean_squared_error(X, y, w):
    return np.mean(np.sum((y - X @ w)**2))

Using two features¶

Let's now try to predict net sales from two variables: the square footage (size) of the store, and the number of competing stores in the area. Our model will be:

$$ \text{predicted net sales} = w_0 + w_1 (\text{square feet}) + w_2(\text{competitors}) $$

Suppose $w_0^*$, $w_1^*$, and $w_2^*$ are our prediction rule's optimal parameters. Do you expect $w_1^*$ to be positive or negative? What about $w_2^*$?

In [12]:
px.scatter(data, x='sq_ft', y='net_sales', title='Net Sales vs. Square Feet')
In [13]:
px.scatter(data, x='competing_stores', y='net_sales', title='Net Sales vs. Competing Stores')

Looking at separate scatter plots only tells part of the story. Let's look at a 3D scatter plot, with one axis for square footage, one axis for competing stores, and one axis for net sales.

In [14]:
fig = go.Figure()
fig.add_trace(go.Scatter3d(x=data['sq_ft'], 
                           y=data['competing_stores'], 
                           z=data['net_sales'], mode='markers'))

fig.update_layout(scene = dict(
    xaxis_title = 'Square Feet',
    yaxis_title = 'Competing Stores',
    zaxis_title = 'Net Sales'))

Our goal is to find the best fitting plane to this set of points.

Discussion Question¶

At the start of this notebook, we fit a prediction rule with a single feature, square feet, and got that the weight of that feature was $w_1^* = 85.389$.

We are about to fit a prediction rule with two features, square feet and competing stores.

Question: Is the weight of the square feet feature, $w_1^*$, for this new prediction rule guaranteed to be equal to 85.389?

A. Yes

B. No

Answer: No.

Our design matrix is:

$$ \begin{pmatrix} 1 & s_1 & c_1\\ 1 & s_2 & c_2\\ \vdots & \vdots & \vdots\\ 1 & s_n & c_n \end{pmatrix} $$

where $s_i$ is the size of the $i$th store, and $c_n$ is the number of competitors. In code:

In [15]:
X_two_feature_model = data[['1', 'sq_ft', 'competing_stores']].values
X_two_feature_model
Out[15]:
array([[ 1.        ,  3.        , 11.        ],
       [ 1.        ,  2.20000005, 12.        ],
       [ 1.        ,  0.5       , 15.        ],
       [ 1.        ,  5.5       ,  1.        ],
       [ 1.        ,  4.4000001 ,  5.        ],
       [ 1.        ,  4.80000019,  4.        ],
       [ 1.        ,  3.0999999 , 10.        ],
       [ 1.        ,  2.5       , 12.        ],
       [ 1.        ,  1.20000005, 15.        ],
       [ 1.        ,  0.60000002,  8.        ],
       [ 1.        ,  5.4000001 ,  1.        ],
       [ 1.        ,  4.19999981,  7.        ],
       [ 1.        ,  4.69999981,  3.        ],
       [ 1.        ,  0.60000002, 14.        ],
       [ 1.        ,  1.20000005, 11.        ],
       [ 1.        ,  1.60000002, 10.        ],
       [ 1.        ,  4.30000019,  4.        ],
       [ 1.        ,  2.5999999 , 13.        ],
       [ 1.        ,  3.79999995,  7.        ],
       [ 1.        ,  5.30000019,  1.        ],
       [ 1.        ,  5.5999999 ,  0.        ],
       [ 1.        ,  0.80000001, 14.        ],
       [ 1.        ,  1.10000002, 12.        ],
       [ 1.        ,  3.5999999 ,  6.        ],
       [ 1.        ,  3.5       ,  5.        ],
       [ 1.        ,  5.0999999 ,  0.        ],
       [ 1.        ,  8.6       ,  8.        ]])

Using the function solve_normal_equations that we already built:

In [16]:
w_two_feature_model = solve_normal_equations(X_two_feature_model, y)
w_two_feature_model
Out[16]:
array([303.49073761,  45.15092186, -21.5851804 ])

This is telling us that the best-fitting plane to this dataset is

$$\text{predicted net sales} = 303.491 + 45.151 (\text{square feet}) - 21.585 (\text{competing stores})$$

Note that the weight of $\text{square feet}$ in this prediction rule is different than the weight of $\text{square feet}$ in the prediction rule that only had one feature!

In [17]:
XX, YY = np.mgrid[1:10:2, 0:16:2]
Z = w_two_feature_model[0] + w_two_feature_model[1] * XX + w_two_feature_model[2] * YY
plane = go.Surface(x=XX, y=YY, z=Z)

fig = go.Figure(data=[plane])
fig.add_trace(go.Scatter3d(x=data['sq_ft'], 
                           y=data['competing_stores'], 
                           z=data['net_sales'], mode='markers', marker = {'color': '#656DF1'}))

fig.update_layout(scene = dict(
    xaxis_title = 'Square Feet',
    yaxis_title = 'Competing Stores',
    zaxis_title = 'Net Sales'))

As before, let's calculate the MSE:

In [18]:
mean_squared_error(X_two_feature_model, y, w_two_feature_model)
Out[18]:
72287.04554905623

Note that this is significantly lower than the MSE of the model with just one feature:

In [19]:
mean_squared_error(X_one_feature_model, y, w_one_feature_model)
Out[19]:
192390.8961789433

All features¶

Let's fit a prediction rule using all of the features.

In [20]:
column_order = ['1', 'sq_ft', 'competing_stores', 'inventory', 'advertising', 'district_size']
X_all_features = data[column_order].values
X_all_features
Out[20]:
array([[1.00000000e+00, 3.00000000e+00, 1.10000000e+01, 2.94000000e+02,
        8.19999981e+00, 8.19999981e+00],
       [1.00000000e+00, 2.20000005e+00, 1.20000000e+01, 2.32000000e+02,
        6.90000010e+00, 4.09999990e+00],
       [1.00000000e+00, 5.00000000e-01, 1.50000000e+01, 1.49000000e+02,
        3.00000000e+00, 4.30000019e+00],
       [1.00000000e+00, 5.50000000e+00, 1.00000000e+00, 6.00000000e+02,
        1.20000000e+01, 1.61000004e+01],
       [1.00000000e+00, 4.40000010e+00, 5.00000000e+00, 5.67000000e+02,
        1.06000004e+01, 1.41000004e+01],
       [1.00000000e+00, 4.80000019e+00, 4.00000000e+00, 5.71000000e+02,
        1.18000002e+01, 1.26999998e+01],
       [1.00000000e+00, 3.09999990e+00, 1.00000000e+01, 5.12000000e+02,
        8.10000000e+00, 1.01000004e+01],
       [1.00000000e+00, 2.50000000e+00, 1.20000000e+01, 3.47000000e+02,
        7.69999981e+00, 8.40000000e+00],
       [1.00000000e+00, 1.20000005e+00, 1.50000000e+01, 2.12000000e+02,
        3.29999995e+00, 2.09999990e+00],
       [1.00000000e+00, 6.00000024e-01, 8.00000000e+00, 1.02000000e+02,
        4.90000010e+00, 4.69999981e+00],
       [1.00000000e+00, 5.40000010e+00, 1.00000000e+00, 7.88000000e+02,
        1.73999996e+01, 1.23000002e+01],
       [1.00000000e+00, 4.19999981e+00, 7.00000000e+00, 5.77000000e+02,
        1.05000000e+01, 1.40000000e+01],
       [1.00000000e+00, 4.69999981e+00, 3.00000000e+00, 5.35000000e+02,
        1.13000002e+01, 1.50000000e+01],
       [1.00000000e+00, 6.00000024e-01, 1.40000000e+01, 1.63000000e+02,
        2.50000000e+00, 2.50000000e+00],
       [1.00000000e+00, 1.20000005e+00, 1.10000000e+01, 1.68000000e+02,
        4.69999981e+00, 3.29999995e+00],
       [1.00000000e+00, 1.60000002e+00, 1.00000000e+01, 1.51000000e+02,
        4.59999990e+00, 2.70000005e+00],
       [1.00000000e+00, 4.30000019e+00, 4.00000000e+00, 3.42000000e+02,
        5.50000000e+00, 1.60000000e+01],
       [1.00000000e+00, 2.59999990e+00, 1.30000000e+01, 1.96000000e+02,
        7.19999981e+00, 6.30000019e+00],
       [1.00000000e+00, 3.79999995e+00, 7.00000000e+00, 4.53000000e+02,
        1.03999996e+01, 1.38999996e+01],
       [1.00000000e+00, 5.30000019e+00, 1.00000000e+00, 5.18000000e+02,
        1.15000000e+01, 1.62999992e+01],
       [1.00000000e+00, 5.59999990e+00, 0.00000000e+00, 6.15000000e+02,
        1.23000002e+01, 1.60000000e+01],
       [1.00000000e+00, 8.00000012e-01, 1.40000000e+01, 2.78000000e+02,
        2.79999995e+00, 6.50000000e+00],
       [1.00000000e+00, 1.10000002e+00, 1.20000000e+01, 1.42000000e+02,
        3.09999990e+00, 1.60000002e+00],
       [1.00000000e+00, 3.59999990e+00, 6.00000000e+00, 4.61000000e+02,
        9.60000000e+00, 1.13000002e+01],
       [1.00000000e+00, 3.50000000e+00, 5.00000000e+00, 3.82000000e+02,
        9.80000019e+00, 1.15000000e+01],
       [1.00000000e+00, 5.09999990e+00, 0.00000000e+00, 5.90000000e+02,
        1.20000000e+01, 1.56999998e+01],
       [1.00000000e+00, 8.60000000e+00, 8.00000000e+00, 5.17000000e+02,
        7.00000000e+00, 1.20000000e+01]])
In [21]:
w_all_features = solve_normal_equations(X_all_features, y)
w_all_features
Out[21]:
array([-18.85941416,  16.20157356,  -5.31097141,   0.17463515,
        11.52626903,  13.5803129 ])
In [22]:
for i, feature in enumerate(column_order):
    if feature == '1':
        print(f'intercept:\t{w_all_features[0]:0.3f}')
    else:
        print(f'{feature}:\t{w_all_features[i]:0.3f}')
intercept:	-18.859
sq_ft:	16.202
competing_stores:	-5.311
inventory:	0.175
advertising:	11.526
district_size:	13.580

The MSE of this model is even lower!

In [23]:
mean_squared_error(X_all_features, y, w_all_features)
Out[23]:
6541.410343631843

Note that I can't visualize this prediction rule, since I would need to be able to visualize in 6D, but I can still find this prediction rule's predictions:

In [24]:
X_all_features @ w_all_features
Out[24]:
array([228.54132342, 128.77828635,  28.57159462, 526.67763281,
       438.55165964, 445.8608775 , 298.19289184, 221.33815906,
        24.49589951,  86.4927322 , 568.5255428 , 423.92508128,
       468.73640753,   7.73991561,  70.48898947,  70.01098006,
       369.96817843, 156.99564616, 393.2790273 , 506.07015112,
       538.3281338 ,  88.84240462,  17.48878601, 352.21694784,
       347.13290225, 518.32948881, 411.92035997])

Which feature is most "important"?¶

We should standardize in order to account for the difference in units and scale between the features.

Question: What would happen if I try to standardize the column of all 1s? 🧐

In [25]:
features = data[column_order].iloc[:, 1:].values
In [26]:
standardized_features = (features - features.mean(axis=0)) / features.std(axis=0)
In [27]:
X_standardized = np.column_stack([
    np.ones(data.shape[0]),
    standardized_features
])
In [28]:
w_standardized = solve_normal_equations(X_standardized, y)
w_standardized
Out[28]:
array([286.57407407,  31.97302867, -25.51529781,  32.76054166,
        42.69274551,  68.49841225])
In [29]:
for i, feature in enumerate(column_order):
    if feature == '1':
        print(f'intercept:\t{w_standardized[0]:0.3f}')
    else:
        print(f'{feature}:\t{w_standardized[i]:0.3f}')
intercept:	286.574
sq_ft:	31.973
competing_stores:	-25.515
inventory:	32.761
advertising:	42.693
district_size:	68.498

The district size appears to have the largest effect on the net sales.

In [30]:
mean_squared_error(X_standardized, y, w_standardized)
Out[30]:
6541.410343631844

Note that standardizing has no impact on the actual predictions made by our prediction rule, and hence the MSE – it just makes the weights more interpretable.


Part 2: Feature Engineering¶


In [31]:
cars = sns.load_dataset('mpg').dropna()
In [32]:
cars
Out[32]:
mpg cylinders displacement horsepower weight acceleration model_year origin name
0 18.0 8 307.0 130.0 3504 12.0 70 usa chevrolet chevelle malibu
1 15.0 8 350.0 165.0 3693 11.5 70 usa buick skylark 320
2 18.0 8 318.0 150.0 3436 11.0 70 usa plymouth satellite
3 16.0 8 304.0 150.0 3433 12.0 70 usa amc rebel sst
4 17.0 8 302.0 140.0 3449 10.5 70 usa ford torino
... ... ... ... ... ... ... ... ... ...
393 27.0 4 140.0 86.0 2790 15.6 82 usa ford mustang gl
394 44.0 4 97.0 52.0 2130 24.6 82 europe vw pickup
395 32.0 4 135.0 84.0 2295 11.6 82 usa dodge rampage
396 28.0 4 120.0 79.0 2625 18.6 82 usa ford ranger
397 31.0 4 119.0 82.0 2720 19.4 82 usa chevy s-10

392 rows × 9 columns

In [33]:
px.scatter(cars, x='horsepower', y='mpg', title='MPG vs. Horsepower')

A regular linear model here isn't great.

In [34]:
cars['1'] = 1
w_cars_one_feature = solve_normal_equations(cars[['1', 'horsepower']], cars['mpg'])
w_cars_one_feature
Out[34]:
array([39.93586102, -0.15784473])
In [35]:
px.scatter(cars, x='horsepower', y='mpg', title='MPG vs. Horsepower')

x_range = np.linspace(40, 220)

fig = go.Figure()
fig.add_trace(go.Scatter(x = cars['horsepower'], y = cars['mpg'], mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_cars_one_feature[0] + w_cars_one_feature[1] * x_range, 
                         name = 'linear prediction rule', 
                         line=dict(color='red')))

fig.update_layout(xaxis_title = 'Horsepower', yaxis_title = 'MPG')

What if we add $\text{horsepower}^2$ as a feature?

In [36]:
cars['horsepower^2'] = cars['horsepower']**2
In [37]:
cars[['1', 'horsepower', 'horsepower^2']]
Out[37]:
1 horsepower horsepower^2
0 1 130.0 16900.0
1 1 165.0 27225.0
2 1 150.0 22500.0
3 1 150.0 22500.0
4 1 140.0 19600.0
... ... ... ...
393 1 86.0 7396.0
394 1 52.0 2704.0
395 1 84.0 7056.0
396 1 79.0 6241.0
397 1 82.0 6724.0

392 rows × 3 columns

In [38]:
w_cars_squared = solve_normal_equations(cars[['1', 'horsepower', 'horsepower^2']], cars['mpg'])
w_cars_squared
Out[38]:
array([ 5.69000997e+01, -4.66189630e-01,  1.23053610e-03])

Let's look at the resulting prediction rule.

In [39]:
px.scatter(cars, x='horsepower', y='mpg', title='MPG vs. Horsepower')

fig = go.Figure()
fig.add_trace(go.Scatter(x = cars['horsepower'], y = cars['mpg'], mode = 'markers', name = 'actual'))
fig.add_trace(go.Scatter(x = x_range, 
                         y = w_cars_one_feature[0] + w_cars_one_feature[1] * x_range, 
                         name = 'linear prediction rule', 
                         line=dict(color='red')))
fig.add_trace(go.Scatter(x = np.linspace(40, 220), 
                         y = w_cars_squared[0] + w_cars_squared[1] * x_range + w_cars_squared[2] * x_range**2, 
                         name = 'quadratic prediction rule', 
                         line=dict(color='#F7CF5D', width=5)))

fig.update_layout(xaxis_title = 'Horsepower', yaxis_title = 'MPG')

Note: this is still a linear prediction rule – it's just not linear in terms of horsepower. It is linear in terms of the parameter vector, $\vec{w}$, which means we can still use the normal equations to find $\vec{w}^*$.